Skip to content

fix(downloads): keep DownloadMonitorService alive when a poll cycle hits a transient SQLite lock#654

Open
kevinheneveld wants to merge 1 commit into
Listenarrs:canaryfrom
kevinheneveld:fix/download-monitor-resilience
Open

fix(downloads): keep DownloadMonitorService alive when a poll cycle hits a transient SQLite lock#654
kevinheneveld wants to merge 1 commit into
Listenarrs:canaryfrom
kevinheneveld:fix/download-monitor-resilience

Conversation

@kevinheneveld

Copy link
Copy Markdown
Contributor

Summary

DownloadMonitorService's poll loop caught only OperationCanceledException, so any other exception thrown during a cycle escaped ExecuteAsync and tripped the .NET host's default BackgroundServiceExceptionBehavior.StopHost — shutting the whole app down. Under a container restart policy that manifests as a crash-loop. This makes the monitor loop log-and-continue on a failed cycle, matching every sibling background service.

Why this matters

A transient Microsoft.Data.Sqlite.SqliteException ("database is locked") is a normal, recoverable event under concurrent writes — it's raised while persisting a download (e.g. EfDownloadRepository.UpdateAsync/GetByIdAsync). With the old loop, a single such exception didn't just fail one cycle; it took the entire host down. On an instance running a container restart policy this became a crash-loop, restarting roughly every poll cycle (observed ~18 restarts in 9 hours), and each restart re-ran startup work — so a recoverable lock turned into sustained downtime and wasted CPU.

DownloadMonitorService was the lone background service with this gap: QueueMonitorService, MovedDownloadProcessor, and AutomaticSearchService already log-and-continue.

Implementation notes

  • Extracted the poll loop into internal async Task RunMonitorLoopAsync(CancellationToken) so the resilience behavior is isolated and unit-testable (the public ExecuteAsync keeps its 5s polling interval, which a test can't wait on).
  • The loop now distinguishes three cases per cycle:
    • OperationCanceledException with cancellation requested → break (graceful shutdown, unchanged).
    • OperationCanceledException without cancellation → a transient timeout within a cycle; logged and the loop continues.
    • Any other exception → logged ("Error in Download Monitor Service") and the loop continues. OutOfMemoryException/StackOverflowException are deliberately not swallowed.
  • Surface area: one application-layer service file (DownloadMonitorService.cs) plus a new test. No API, schema, or behavioral change to what the monitor does on a successful cycle.
  • Backwards compatible: identical behavior on the happy path and on cancellation; the only change is that a failed cycle no longer terminates the host.

What this PR does NOT include

  • The underlying SQLite write contention. The connection string sets no busy_timeout, so concurrent writes can still raise transient locks — this PR stops a single transient lock from being fatal, but reducing how often it occurs is a separate hardening change.

Test plan

  • cd tests && dotnet test — 683 / 683 passing (1 new regression test: RunMonitorLoopAsync_CycleThrows_KeepsLoopingInsteadOfStoppingHost drives the extracted loop, forces a cycle to throw, and asserts the loop task is still running rather than completed).
  • No frontend changes (backend-only).

…sient SQLite lock

DownloadMonitorService's poll loop caught only OperationCanceledException, so any
other exception from a cycle propagated out of ExecuteAsync. A BackgroundService
that throws trips the .NET host's default BackgroundServiceExceptionBehavior.StopHost,
so the host shut down gracefully (exit 0) and the container's unless-stopped policy
restarted it. In practice the cycle threw a Microsoft.Data.Sqlite "database is locked"
SqliteException while persisting a download (EfDownloadRepository.UpdateAsync/
GetByIdAsync) under write contention, restarting the whole app every poll cycle
(~18 restarts in 9h live). Each restart re-ran startup scans and reset the
stall-reaper's in-memory 60-minute timers so it never reaped, keeping CPU pegged.

The sibling background services (QueueMonitorService, MovedDownloadProcessor,
AutomaticSearchService) already log-and-continue on a failed cycle. The poll loop
is extracted to RunMonitorLoopAsync (for isolation + a fast, parallel-safe unit
test) and now swallows-and-continues on any non-cancellation, non-fatal exception
instead of letting it stop the host. The underlying lock contention (no busy_timeout
on the SQLite connection) is a separate hardening follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kevinheneveld kevinheneveld marked this pull request as ready for review June 6, 2026 20:45
@kevinheneveld kevinheneveld requested a review from a team June 6, 2026 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants